Skip to content

Conversation

@varun-sundar-rabindranath
Copy link
Contributor

@varun-sundar-rabindranath varun-sundar-rabindranath commented Nov 21, 2025

Purpose

On B200, the following command fails,

VLLM_ALL2ALL_BACKEND=deepep_low_latency VLLM_USE_DEEP_GEMM=1  vllm serve  Qwen/Qwen3-30B-A3B-FP8 --trust-remote-code --tensor-parallel-size 1 --data-parallel-size 4 --enable-expert-parallel --no-enable-prefix-caching  --port 9010 --enable-eplb  --eplb-config '{"window_size":10,"step_interval":100,"num_redundant_experts":0,"log_balancedness":true}' --enforce-eager

On B200, when we use DeepGEMM, we transpose the MoE weight_scale tensors for efficient DeepGEMM matmuls. This in combination with EPLB fails the following assertion,

        assert all(
            weight.is_contiguous()
            for name, weight in weights
            if not name.startswith("_shared_experts.")
        )

as the weight scale tensors are not contiguous.

Fix: This PR changes to view of the tensor so the is_contiguous() check passes. We also add a test to verify that this view update is safe.

Test Plan

tests/distributed/test_eplb_fused_moe_layer.py passes

lm-eval test

server-command : VLLM_ALL2ALL_BACKEND=deepep_low_latency VLLM_USE_DEEP_GEMM=1  vllm serve  Qwen/Qwen3-30B-A3B-FP8 --trust-remote-code --tensor-parallel-size 1 --data-parallel-size 4 --enable-expert-parallel --no-enable-prefix-caching  --port 9010 
lm_eval --model local-completions --tasks gsm8k --model_args model=Qwen/Qwen3-30B-A3B-FP8,base_url=http://localhost:9010/v1/completions,num_concurrent=30,max_retries=3 --limit 100

Test Result

server command + --enable-eplb --eplb-config '{"window_size":10,"step_interval":100,"num_redundant_experts":0,"log_balancedness":true}'

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.88|±  |0.0327|
|     |       |strict-match    |     5|exact_match|↑  | 0.92|±  |0.0273|

server command (without eplb)

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.87|±  |0.0338|
|     |       |strict-match    |     5|exact_match|↑  | 0.89|±  |0.0314|

@mergify
Copy link

mergify bot commented Nov 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @varun-sundar-rabindranath.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 21, 2025
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug that occurs on B200 hardware when using DeepGEMM in combination with EPLB, where transposed, non-contiguous MoE weight scale tensors cause an assertion failure. The fix involves changing the tensor's view to be contiguous before it's processed by the EPLB logic. The PR also introduces a new test to validate the fix and refactors some distributed testing utilities into a shared file, which is a good improvement. The overall approach is sound and the changes are well-implemented. I have one minor suggestion to improve the clarity of a docstring for future maintainability.

@varun-sundar-rabindranath
Copy link
Contributor Author

@codex review

@varun-sundar-rabindranath
Copy link
Contributor Author

cc @elvircrn @ilmarkov @SageMoore @abmfy PTAL! Thanks 🙌

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Hooray!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@ilmarkov
Copy link
Contributor

Looks good to me! Thank you for the fix!


weights = list(self.named_parameters())
weights = [(name, _maybe_make_contiguous(name, p)) for name, p in weights]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, instead of is_contiguous check we need to check that we are row-major with num_local_experts in the first dimension. The fact that tensor is contiguous in other dimensions is not important as we flatten the view in those dimensions.

@tlrmchlsmth tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 21, 2025
@varun-sundar-rabindranath
Copy link
Contributor Author

Looks like pre-commit is failing on unrelated files in vllm PRs - Saw the same failure here #29188

Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Copy link
Member

@abmfy abmfy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current fix LGTM. Thanks!

@abmfy
Copy link
Member

abmfy commented Nov 21, 2025

The point of the assertion is to ensure that .view(num_experts, -1) does not fail. I have not actually tested this, but as @ilmarkov suggested, we could simply check whether the tensor is row-major and then remove that view operation entirely.

@vllm-bot vllm-bot merged commit 3137991 into vllm-project:main Nov 21, 2025
52 of 55 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants